.. _spatial:


Working with spatial data in R
******************************


Prerequisites
=============


**Data Access**


To work with SOEP’s spatial data you need to have a data distribution
contract together with an application for stationary access to a
terminal in the Research Data Center (RDC). The forms can be downloaded
from `here <https://www.diw.de/en/diw_02.c.222843.en/forms.html>`__.
Having data access granted, you can use the IGEL workstations in the
SOEP RDC to get access to the corresponding server infrastructures MORAN
and HAUSER. The MORAN server hosts the coordinates of the SOEP
households together with some fake coordinates that cannot be
distinguished. On this server you can, for example, compute distances or
nearest points as shown later in this tutorial. On the HAUSER server you
are provided your computational outcome form the MORAN server but
without the coordinates. The coordinates of the SOEP households **must
never** be available together with SOEP survey data. More information on
how to use the SOEP IGEL technology can be found
`here <http://companion.soep.de/Working%20with%20SOEP%20Data/Working%20with%20SOEP%20IGEL.html>`__.


**R**


When working with SOEP’s spatial data at the RDC using SOEP IGEL we
provide you with the software you need. If you want to work with spatial
test data before, you need a certain software setup. To be able to work
yourself through the following examples you need to have
`R <https://cran.r-project.org/>`__ and
`RStudio <https://rstudio.com/>`__ installed. Further, for working the
spatial data in R, you need to have GEOS, GDAL, and PROJ installed, too.
When running a Windows OS, simply install the Rtools for Windows from
`here <https://cran.r-project.org/bin/windows/Rtools/>`__. MacOS and
Linux users are referred to the website of the ``sf`` package for
installation instructions. You can find them
`here <https://r-spatial.github.io/sf/index.html>`__.

If you are not yet familiar with R we recommend the book *R for Data
Science* by Wickham and Grolemund which is available
`here <https://r4ds.had.co.nz/transform.html>`__ for free. Besides that,
there two freely available books which are helpful to get you started
with spatial data in R:

-  *Spatial Data Science* by Pebesma and Bivand which is available
   `here <https://keen-swartz-3146c4.netlify.app/>`__
-  *Geocomputation with R* by Lovelace, Nowosad, and Muenchow which is
   available `here <https://geocompr.robinlovelace.net/>`__


**Coordinate Reference Systems**


Any location of an object can be characterised by its coordinates.
Besides a coordinate the altitude of an object might be of interest. So
locations can be given in 2-dimensional or 3-dimensional space. A
2-dimensional coordinate referenced to two orthogonal axes (say x and y)
can be given by either a pair of values (x,y) or by an angle (with
respect to one of the axes) and a distance. The pair of coordinates
(x,y) is referred to as cartesian coordinates and the latter as polar or
spherical coordinates. In a 3-dimensional world, where you define a
location on a globe, things become a little more complicated, because
earth is an ellipsoid rather than a perfect sphere. In defining a
location in a 3-dimensional coordinate system, the *coordinate reference
system* (CRS) is very important. The CRS combines the information about
the coordinate system itself and a geodetic datum. The geodetic datum
describes how the location of the coordinate system is related to earth.
So different CRS will provide different values of coordinates for the
same location. Finally, the projection method provides information on
how to flatten objects that are on a round surface so they can be viewed
on a flat surface. The CRS is often summarised by an EPSG-code (European
Petroleum Survey Group Geodesy) allowing for convenient transformations
between different CRS. For more details we refer you to Chapter 2 of
*Spatial Data Science* or Chapter 6 of *Geocomputation with R*. The most
frequently used CRS are displayed in the following table.


.. csv-table::
   :header-rows: 1
   :file: csv/relevant_crs.csv


The differences in the maps output can be seen in the following figure.
The maps show Europe (provided by
`https://data.opendatasoft.com <https://data.opendatasoft.com/explore/dataset/european-union-countries%40public/export/?sort=admin&location=5,36.37176,-6.67969&basemap=jawg.sunny&dataChart=eyJxdWVyaWVzIjpbeyJjb25maWciOnsiZGF0YXNldCI6ImV1cm9wZWFuLXVuaW9uLWNvdW50cmllc0BwdWJsaWMiLCJvcHRpb25zIjp7fX0sImNoYXJ0cyI6W3siYWxpZ25Nb250aCI6dHJ1ZSwidHlwZSI6ImNvbHVtbiIsImZ1bmMiOiJBVkciLCJ5QXhpcyI6InNjYWxlcmFuayIsInNjaWVudGlmaWNEaXNwbGF5Ijp0cnVlLCJjb2xvciI6IiMxNDJFN0IifV0sInhBeGlzIjoic292ZXJlaWdudCIsIm1heHBvaW50cyI6NTAsInNvcnQiOiIifV0sInRpbWVzY2FsZSI6IiIsImRpc3BsYXlMZWdlbmQiOnRydWUsImFsaWduTW9udGgiOnRydWV9>`__)
in two different CRS.


.. figure:: png/projection-1.png
   :align: center  
   :alt: Europe displayed in different CRS.


The CRS of the fake SOEP households is defined by the EPSG code 32632.
The CRS usually used by Google Maps / OpenStreetMap is defined by EPSG
code 4326. To check you can use for example
https://coordinates-converter.com/.

If you already know about all of the prerequisites and are familiar with
R, the sf package as well as the tidyverse, you can directly switch to
the complete example provided in section *Complete Example*. If you are
not familiar with all of this, we recommend you go through the following
text first.


Reading data
============


Spatial data comes in a variety of formats, ranging from ASCII-files to
more sophisticated data structures like shapefiles or XML-dialects.
Fortunately the R package
`sf <https://cran.r-project.org/web/packages/sf/index.html>`__ provides
functions to conveniently read most of the data formats. Besides that,
the package and its use is well documented
`here <https://r-spatial.github.io/sf/index.html>`__. The package can
easily be installed and loaded by using the following lines of code in
R.

.. code:: r

   install.packages('sf') # installing the package 
   library(sf) # loading the package

   # further packages 
   install.packages('tidyverse') # a selection of packages 
   library(tidyverse) # (mainly using dplyr, ggplot2)
   install.packages('here') # for working with .RProj and paths
   library(here)

Further packages that are useful when working with R are the tidyverse
(a selection of several packages) and here (makes working in projects
easier). A complete list of packages used to create this page is
provided in Appendix at the end of this page.

When working with spatial data in the SOEP we provide you with fake
coordinates of the SOEP households in a shapefile format. This format is
easy to read using the sf package in R and provides you with all the
details you need. The data you bring along is likely to be in a
different format. Formats that can be read using ``st_read`` are listed
in the `GDAL
Documentation <https://gdal.org/drivers/vector/index.html>`__. If you
bring along data in other formats consider the
`haven <https://cran.r-project.org/web/packages/haven/index.html>`__
package for SAS, SPSS, or Stata formats, documented
`here <https://haven.tidyverse.org/>`__. Non-spatial data formats
however, mostly include no information on the CRS. Thus you need to know
and assign it to the data. An example when reading data from
spreadsheets is provided below.


**Shapefiles**


Shapefiles consist of at least three files, namely

-  ``.shp`` containing the geometries for example for a point (address)
   or a polygon (state)
-  ``.dbf`` containing the attribute data for each geometry, for exampe,
   the address data (street, zip code, city) or names of the states
-  ``.shx`` containing the indices linking geometries and attribute data

There are usually some more files provided containing further
information. One example of shapefiles is provided by the Federal Agency
for Cartography and Geodesy (`BKG <https://www.bkg.bund.de>`__,
GeoBasis-DE / BKG 2019). The BKG provides public available data on
administrative areas in Germany. The dataset ``VG250`` includes theses
administrative units of the hierarchical levels from the country down to
municipalities and is available
`here <https://gdz.bkg.bund.de/index.php/default/open-data/verwaltungsgebiete-1-250-000-ebenen-stand-01-01-vg250-ebenen-01-01.html>`__.
Unzipping the folder will provide you with the shapefiles and the
documentation of the data. Because the data contains information on
different levels (Germany, its states, district, municipalities) these
are stored in different layers. To check which information is avaialable
in the data, use ``st_layers``. If you do not explicitly provide a
layer, ``st_read`` will take the first available layer. The function
returns the layer names, the according type of geometry, the number of
features (geometries) and fields (variables). Reading the German Federal
States you need to pick the layer ``VG250_LAN``.

.. code:: r

   # layers in the data
   st_layers(here('Daten', 'vg250_ebenen_0101'))

::

   ## Driver: ESRI Shapefile 
   ## Available layers:
   ##    layer_name geometry_type features fields
   ## 1   VG250_GEM       Polygon    11135     26
   ## 2   VG250_KRS       Polygon      431     26
   ## 3   VG250_LAN       Polygon       35     26
   ## 4    VG250_LI   Line String    35483      4
   ## 5    VG250_PK         Point    11003     13
   ## 6   VG250_RBZ       Polygon       20     26
   ## 7   VG250_STA       Polygon       11     26
   ## 8   VG250_VWG       Polygon     4688     26
   ## 9    VG_DATEN            NA       35      8
   ## 10   VG_WERTE            NA       32      4

.. code:: r

   # read the shapefile for Federal States
   States <- st_read(here('Daten', 'vg250_ebenen_0101'), layer = 'VG250_LAN', quiet = TRUE)

Taking a look at a the main variables ``GEN`` (Geographical name),
``BEZ`` (Designation), and ``geometry`` (a multipolygon), we are
provided with information displayed differently from usual
``data.frame`` or ``tibble`` like data. The ‘header’ tells you it’s a
collection of simple features having 16 features (rows) and 2 fields
(variables, excluding the geometry column). The type of geomtry is a
multipolygon (set of polygons) in a two-dimensional space (XY). The
bounding box (``bbox``) provides information on the bottom left and top
right corner of the box bounding the geometries. Finally, you get
information on the coordinate reference system (``CRS``), summarized by
EPSG-Code 25832.

.. code:: r

   # look at the data
   States %>% 
     filter(GF == 4) %>% # use land with structure only
     select(GEN, BEZ, geometry) # select necessary variables

::

   ## Simple feature collection with 16 features and 2 fields
   ## geometry type:  MULTIPOLYGON
   ## dimension:      XY
   ## bbox:           xmin: 280371.1 ymin: 5235856 xmax: 921292.4 ymax: 6101444
   ## projected CRS:  ETRS89 / UTM zone 32N
   ## First 10 features:
   ##                    GEN                  BEZ                       geometry
   ## 1   Schleswig-Holstein                 Land MULTIPOLYGON (((464810.7 61...
   ## 2              Hamburg Freie und Hansestadt MULTIPOLYGON (((578219 5954...
   ## 3        Niedersachsen                 Land MULTIPOLYGON (((479451.1 59...
   ## 4               Bremen     Freie Hansestadt MULTIPOLYGON (((466930.3 58...
   ## 5  Nordrhein-Westfalen                 Land MULTIPOLYGON (((477607.3 58...
   ## 6               Hessen                 Land MULTIPOLYGON (((534242 5721...
   ## 7      Rheinland-Pfalz                 Land MULTIPOLYGON (((416304.5 56...
   ## 8    Baden-Württemberg                 Land MULTIPOLYGON (((546771.2 55...
   ## 9               Bayern            Freistaat MULTIPOLYGON (((609387.6 52...
   ## 10            Saarland                 Land MULTIPOLYGON (((359841 5499...


**SOEP Regional Data**


Also the fake coordinates for the SOEP households are provided in a
shapefile format. The data contains an ID-variable (``ID``) and the
survey year (``erhebj``) together with a point geometry. The ``ID``
provided is not related to any of the SOEP’s ``hid`` or ``pid``. It is
simply a number ranging from 1 to the corresponding number of rows in
the data set. Households can thus only be identified, merged, or
selected using their coordinate. Moreover the data also contain the
SOEP’s initial sample. Thus containing regional information on
respondents and non-respondents. Because of data protection regulations,
the SOEP survey data must not be used together with the household’s
coordinates. This is why the environment where you are allowed to work
with the household’s coordinates is strictly separated. Unlike the
``VG250`` data the SOEP data has the EPSG-code 32632.

.. code:: r

   SOEP <- st_read(here('Daten', 'soep_v29'))

::

   ## Reading layer `soep_hh_korr_nur_fakes_utm32-v29' from data source `/media/H/Projekte/Adhoc_Analysen/GeodatenBSP/Daten/soep_v29' using driver `ESRI Shapefile'
   ## Simple feature collection with 209734 features and 2 fields
   ## geometry type:  POINT
   ## dimension:      XY
   ## bbox:           xmin: 288107 ymin: 5250939 xmax: 888152 ymax: 6055336
   ## projected CRS:  WGS 84 / UTM zone 32N

.. code:: r

   SOEP

::

   ## Simple feature collection with 209734 features and 2 fields
   ## geometry type:  POINT
   ## dimension:      XY
   ## bbox:           xmin: 288107 ymin: 5250939 xmax: 888152 ymax: 6055336
   ## projected CRS:  WGS 84 / UTM zone 32N
   ## First 10 features:
   ##    erhebj ID               geometry
   ## 1    2000  1 POINT (355929 5665928)
   ## 2    2000  2 POINT (800240 5821852)
   ## 3    2000  3 POINT (794464 5831012)
   ## 4    2000  4 POINT (478831 5431318)
   ## 5    2000  5 POINT (799318 5825023)
   ## 6    2000  6 POINT (794752 5833726)
   ## 7    2000  7 POINT (799149 5824960)
   ## 8    2000  8 POINT (791594 5827687)
   ## 9    2000  9 POINT (791657 5827422)
   ## 10   2000 10 POINT (792280 5826345)


**Spreadsheets**


When reading spatial data from spreadsheets the data is read by
different drivers. The example data is contained in an Excel spreadsheet
with three variables ``X`` (longitude), ``Y`` (latitude), and
``Object``. Using the ``st_read`` function the csv-driver from GDAL is
chosen automatically, reading 4 columns however. Thus we get rid of the
last column first.

.. code:: r

   POI <- st_read(here('Daten', 'POIs_Berlin.csv'))

::

   ## Reading layer `POIs_Berlin' from data source `/media/H/Projekte/Adhoc_Analysen/GeodatenBSP/Daten/POIs_Berlin.csv' using driver `CSV'

.. code:: r

   POI

::

   ##                   X                Y                    Objekt field_4
   ## 1  13.3776978296166 52.5162788298363         Brandenburger Tor    <NA>
   ## 2  13.3693019102667 52.5250274843424              Hauptbahnhof    <NA>
   ## 3  13.3886070846104 52.5121727411876                DIW Berlin    <NA>
   ## 4  13.4589278721678  52.496825784817              Mulecule Man    <NA>
   ## 5  13.2887256012712 52.5580119710438           Flughafen Tegel    <NA>
   ## 6  13.4005469632413 52.4792779435797          Tempelhofer Feld    <NA>
   ## 7  13.4484578382279 52.4482123689564          Hufeisensiedlung    <NA>
   ## 8   13.281110837709 52.4473851451014                 FU Berlin    <NA>
   ## 9  13.1725456943567 52.4300072503147                   Wannsee    <NA>
   ## 10 13.2128459704725 52.5411476127425         Zitadelle Spandau    <NA>
   ## 11 13.5727300494297 52.4438163134271          Schloss Köpenick    <NA>
   ## 12 13.2795658406826 52.5080182126806 Zentraler Omnibus Bahnhof    <NA>

.. code:: r

   POI <- POI[, -4]

Because this data contains latitude and longitude already, we can simply
transform it to a spatial dataset using ``st_as_sf`` and the CRS has to
be assigned to it.

.. code:: r

   POI <- st_as_sf(POI, coords = c("X", "Y"), crs = 4326)
   POI

::

   ## Simple feature collection with 12 features and 1 field
   ## geometry type:  POINT
   ## dimension:      XY
   ## bbox:           xmin: 13.17255 ymin: 52.43001 xmax: 13.57273 ymax: 52.55801
   ## geographic CRS: WGS 84
   ## First 10 features:
   ##               Objekt                  geometry
   ## 1  Brandenburger Tor  POINT (13.3777 52.51628)
   ## 2       Hauptbahnhof  POINT (13.3693 52.52503)
   ## 3         DIW Berlin POINT (13.38861 52.51217)
   ## 4       Mulecule Man POINT (13.45893 52.49683)
   ## 5    Flughafen Tegel POINT (13.28873 52.55801)
   ## 6   Tempelhofer Feld POINT (13.40055 52.47928)
   ## 7   Hufeisensiedlung POINT (13.44846 52.44821)
   ## 8          FU Berlin POINT (13.28111 52.44739)
   ## 9            Wannsee POINT (13.17255 52.43001)
   ## 10 Zitadelle Spandau POINT (13.21285 52.54115)


Transformations
===============


To be able to work with all three data sets (``States``, ``SOEP``, and
``POI``) we have to make sure all have the same CRS. Using
``st_transform`` we can reproject the data sets to have the same
EPSG-code (``common_crs``).

.. code:: r

   common_crs <- 25832

   SOEP <- st_transform(SOEP, crs = common_crs)
   SOEP <- SOEP[SOEP$erhebj == 2005, ] # only use 2005 SOEP data

   POI <- st_transform(POI, crs = common_crs)

   # States <- st_transform(States, crs = common_crs) aleady correct crs

Sometimes it might be easier for you to work in other programs and you
wish to have latitude and longitude data. In this case you can use
``st_coordinates`` to transform point geometries into (x,y) coordinates.
The below example first transforms the coordinates for the households 1
to 5 in the ``SOEP`` data into latitude and longitude data and then
creates the (x,y) coordinates from the point geometry.

.. code:: r

   soep5_lat_lon <-  st_transform(SOEP[1:5,], 4326)
   st_coordinates(soep5_lat_lon)

::

   ##          X        Y
   ## 1  6.94659 51.12465
   ## 2 13.36828 52.48428
   ## 3 13.40601 52.39064
   ## 4  8.70727 49.03966
   ## 5 13.43205 52.48811


Plotting Spatial Data
=====================


Besides being able to look up coordinates or objects in google maps or
OpenStreetMap, we can use the ``ggplot2`` package included in the
``tidyverse`` package for displaying the data. The package provides the
``geom_sf`` function to easily plot the data.

.. code:: r

   # subsetting and plotting the data
   States %>% # use the states data
     filter(GEN == 'Berlin') %>% # filter for Berlin only
     ggplot() + # plot base
     geom_sf(fill = 'white') + # add the polygon for Berlin and fill it whtie
     geom_sf(data = POI) + # add the POI data
     geom_sf_text(data = POI, aes(label = Objekt), 
                  nudge_y = -1000, 
                  check_overlap = TRUE) + # overlapping labels will not be displayed
     xlab('') + ylab('')
     

.. figure:: png/plot_data-1.png
   :align: center  
   

Frequently Used Operations
==========================


This section will provide an overview of some frequently used operations
when working with spatial data. The dataset ``SOEP`` contains some fake
coordinates of household addresses we will work with throughout the
examples.


.. _contains:

**Finding Households in a Specified Area**


Suppose we are interested in identifying all the SOEP households located
in Berlin. The corresponding polygon for Berlin is provided in the
``States`` dataset. Because we are interested in Berlin only, we save
the polygon in an own object ``BE``. The function ``st_contains``
identifies the row-index in the ``SOEP`` dataset that fall within the
polygon ``BE`` and returns a list (``soep_in_berlin``). For checks along
the way we can look at plots of the data.

.. code:: r

   BE <- States[States$GEN == 'Berlin', ]
   BE %>% select(GEN, BEZ)

::

   ## Simple feature collection with 1 feature and 2 fields
   ## geometry type:  MULTIPOLYGON
   ## dimension:      XY
   ## bbox:           xmin: 777974.1 ymin: 5808837 xmax: 823510.5 ymax: 5845580
   ## projected CRS:  ETRS89 / UTM zone 32N
   ##       GEN  BEZ                       geometry
   ## 11 Berlin Land MULTIPOLYGON (((802831.7 58...

.. code:: r

   soep_in_berlin <- st_contains(BE, SOEP)
   soep_in_berlin

::

   ## Sparse geometry binary predicate list of length 1, where the predicate was `contains'
   ##  1: 2, 3, 5, 6, 8, 9, 10, 11, 12, 13, ...

.. code:: r

   SOEP_BE <- SOEP[unlist(soep_in_berlin), ]

   BE %>% 
     ggplot() + geom_sf(fill = 'white') + geom_sf(data = SOEP_BE)


.. figure:: png/soep_in_BE-1.png
   :align: center  
   

.. _distances:

**Distances / Areas**


To compute distances the function ``st_distance`` can be provided a
single dataset with n rows providing a n x n matrix of distances of the
geometries contained in the data (``dist_m``). The unit of the distance
returned depends on the CRS. In the example provided below the distances
of the POIs to the location of the DIW are given in meters.

Providing the function a second dataset of m rows will create a n x m
distance matrix (``dist_soep_poi``). The created object is a matrix with
529 rows (SOEP households in Berlin) and 12 columns (POIs in Berlin).

When computing distances on large data sets it might be helpful to
subset the data, because the distance of a household in Munich might be
irrelevant to a research question focusing on Berlin or distances
smaller than 5000m. According the the CRS, consider specifying the
``which`` argument.

Areas can be computed for (multi-)polygons. The function ``st_area``
provides the corresponding information.

.. code:: r

   # distances between the POIs
   dist_m <- st_distance(POI)
   rownames(dist_m) <- POI$Objekt
   colnames(dist_m) <- POI$Objekt

   # distance of POIs to DIW Berlin
   dist_m['DIW Berlin', ]

::

   ## Units: [m]
   ##         Brandenburger Tor              Hauptbahnhof                DIW Berlin 
   ##                  870.8119                 1941.2940                    0.0000 
   ##              Mulecule Man           Flughafen Tegel          Tempelhofer Feld 
   ##                 5074.7951                 8488.2154                 3751.7696 
   ##          Hufeisensiedlung                 FU Berlin                   Wannsee 
   ##                 8202.7627                10269.1064                17307.5100 
   ##         Zitadelle Spandau          Schloss Köpenick Zentraler Omnibus Bahnhof 
   ##                12364.7837                14651.8215                 7422.6480

.. code:: r

   # distance between each household and each POI
   dist_soep_poi <- st_distance(SOEP_BE, POI)
   dim(dist_soep_poi)

::

   ## [1] 529  12

.. code:: r

   # save distances in an object
   DIST <- as_tibble(dist_soep_poi)
   # add names
   names(DIST) <- str_c('distance_to_', str_remove(POI$Objekt, ' '))

   # attach distances to data
   SOEP_BE <- bind_cols(SOEP_BE, DIST)


   # area covered by Berlin
   st_area(BE)

::

   ## 893060962 [m^2]


.. _nearest:

**Nearest Point**


To find the point or feature closest to another one the function
``st_nearest_feature`` will return a vector with indices of the nearest
feature. In the example below we are looking for the households living
closest to the POIs in Berlin. In the second step we compute the
corresponding distances between the POI and the household.

.. code:: r

   nearest_hh <- st_nearest_feature(POI, SOEP_BE)

   diag(st_distance(POI, SOEP_BE[nearest_hh, ]))

::

   ## Units: [m]
   ##  [1]  256.5810 1195.8833  789.6812  392.5787 1913.8130  891.4404 1295.8621
   ##  [8]  520.3830  971.0417  364.4532  250.2833 1194.7832

.. code:: r

   BE %>% # polygon for Berlin
     ggplot() + geom_sf(fill = 'white') + # plot the Berlin polygon
     geom_sf(data = POI) + # add the POIs
     geom_sf(data = SOEP_BE[nearest_hh, ], col = 'red') # add the nearest household


.. figure:: png/nearest-1.png
   :align: center  
   

**Within Distance**


If your interest is about which households live within a certain
distance to a specific point, ``st_is_within_distance`` lets you lookup
geometries in a given distance (argument ``dist``) and returns a list.
The below example looks up households in a 5km distance of the
Brandenburger Tor. The plot shows the 5km radius area in yellow, the
location of the Brandenburger Tor (black dot) and than households within
the distance (red dots).

.. code:: r

   r_5000 <- st_is_within_distance(POI[POI$Objekt == 'Brandenburger Tor',], 
                                   SOEP_BE,
                                   dist = 5000)
   r_5000

::

   ## Sparse geometry binary predicate list of length 1, where the predicate was `is_within_distance'
   ##  1: 1, 3, 6, 12, 21, 25, 28, 39, 54, 55, ...

.. code:: r

   BE %>% # polygon for Berlin
     ggplot() + geom_sf(fill = 'white') + # plot the Berlin polygon
     geom_sf(data = POI[POI$Objekt == 'Brandenburger Tor',]) + # add the POI
     geom_sf(data = SOEP_BE[unlist(r_5000), ], col = 'red') + # add the nearest household
     geom_sf(data = st_buffer(POI[POI$Objekt == 'Brandenburger Tor',], dist = 5000), 
             alpha = 0.2, fill = 'yellow')
             
.. figure:: png/withinDistance-1.png
   :align: center  
              

The question cal also be asked the other way around: How many POIs are
within a 5km radius of the SOEP households? This way the function
``st_is_within_distance`` returns a list of length equal to the number
of SOEP households in Berlin (529). For each household the (row) index
for the POI is given. To get the number of POIs in the 5km radius, we
can simply ask for the length (the number of row indices) of each
list-component. To get the according distances see section *Distances /
Areas*

.. code:: r

   poi_5000 <- st_is_within_distance(SOEP_BE, POI, dist = 5000)
   poi_5000

::

   ## Sparse geometry binary predicate list of length 529, where the predicate was `is_within_distance'
   ## first 10 elements:
   ##  1: 1, 2, 3, 6
   ##  2: (empty)
   ##  3: 1, 3, 4, 6, 7
   ##  4: 5
   ##  5: 2, 5, 12
   ##  6: 1, 2, 12
   ##  7: 10
   ##  8: 5, 10, 12
   ##  9: 8
   ##  10: 5, 10, 12

.. code:: r

   N_POI <- as_tibble(sapply(poi_5000, length))
   names(N_POI) <- 'n_poi_in_5km'

   SOEP_BE <- bind_cols(SOEP_BE, N_POI)


.. _joins:

**Spatial joins**


When you are used to working with SOEP data you will have probably
merged / joined data sets using the identifying variables (``cid``,
``hid``, ``pid``) and the survey year (``syear``) before. When you are
working with spatial data you will have to choose one of the geometry
predicate function provided by the ``sf`` package. The default is a left
join of the two data sets using ``st_intersects`` as the geometry
predicate function for joining. You can however change this, for
example, to join the nearest features, see section *Nearest Point*. In
our example here, we join the nearest SOEP household to each of the
points of interest. The geometry column here provides the coordinates
from the POI data set.

.. code:: r

   NEAR <- st_join(POI, SOEP, join = st_nearest_feature)
   NEAR

::

   ## Simple feature collection with 12 features and 3 fields
   ## geometry type:  POINT
   ## dimension:      XY
   ## bbox:           xmin: 783630.5 ymin: 5817059 xmax: 810721.7 ymax: 5831749
   ## projected CRS:  ETRS89 / UTM zone 32N
   ## First 10 features:
   ##               Objekt erhebj    ID                 geometry
   ## 1  Brandenburger Tor   2005 75759 POINT (796986.8 5827473)
   ## 2       Hauptbahnhof   2005 69076 POINT (796358.6 5828411)
   ## 3         DIW Berlin   2005 75759 POINT (797754.3 5827062)
   ## 4       Mulecule Man   2005 75646 POINT (802628.5 5825649)
   ## 5    Flughafen Tegel   2005 73266 POINT (790677.7 5831749)
   ## 6   Tempelhofer Feld   2005 69820 POINT (798787.2 5823455)
   ## 7   Hufeisensiedlung   2005 69477 POINT (802251.6 5820202)
   ## 8          FU Berlin   2005 67568 POINT (790892.2 5819422)
   ## 9            Wannsee   2005 67923 POINT (783630.5 5817059)
   ## 10 Zitadelle Spandau   2005 68827 POINT (785646.9 5829572)


**Export Results**


To export your results you can use ``st_write`` to create a ``.csv``
file. When exporting your results pleace check the requirements
`here <http://companion.soep.de/Working%20with%20SOEP%20Data/Working%20with%20SOEP%20IGEL.html#data-transfer-from-moran-to-hauser>`__.

.. code:: r

   path_export <- paste0('/home/', Sys.info()['user'], '/transfer/export/',  Sys.Date())

   if(!file.exists(path_export)){
     dir.create(path_export, recursive = TRUE)
   }

   st_write(SOEP_BE, file.path(path_export, 'Output_SOEP_BE.csv'), 
            append = FALSE, overwrite = TRUE)

   README <- tibble(name = names(SOEP_BE)[-grep('geometry', names(SOEP_BE))],
                    description = c('erhebj', 
                                    'ID', 
                                    'distance (in meters) of household to Brandenburger Tor',
                                    'distance (in meters) of household to Hauptbahnhof',
                                    'distance (in meters) of household to DIW-Berlin',
                                    'distance (in meters) of household to Mulecule Man',
                                    'distance (in meters) of household to Flughafen Tegel',
                                    'distance (in meters) of household to Tempelhofer Feld',
                                    'distance (in meters) of household to Hufeisensiedlung',
                                    'distance (in meters) of household to FU Berlin',
                                    'distance (in meters) of household to Wannsee',
                                    'distance (in meters) of household to Zitadelle Spandau',
                                    'distance (in meters) of household to Schloss Köpenick',
                                    'distance (in meters) of household to Zentraler Omnibus Bahnhof',
                                    'number of POIs within 5 km radius of household'))

   write.csv(README, file.path(path_export, 'Output_SOEP_BE.csv'), 
            row.names = FALSE)


.. _completeexample:

Complete Example
================

Suppose you want to know which households of the SOEP from survey year
2011 live within a distance of 5000m to the following points of interest
(POI):

-  Brandenburger Tor
-  Zitadelle Spandau
-  Wannsee

Besides that, you want to know how far their distance to the
corresponding POI is and which household lives closest to the
corresponding POI. After computing the informations need you want to
export the results for further use on the HAUSER server.

.. raw:: html

   <!-- to do: Eleganter ist's vielleicht hiermit -->

.. raw:: html

   <!-- https://bookdown.org/yihui/rmarkdown-cookbook/code-appendix.html -->

.. code:: r

   # Global stuff
   # ~~~~~~~~~~~~

   # packages
   library(here)
   library(sf)
   library(tidyverse)

   # global values
   survey_year <- 2011
   distance <- 5000 # meter
   common_crs <- 25832


   # Step 1: read the data
   # ~~~~~~~~~~~~~~~~~~~~~ 

   # read polygons for Federal States
   States <- st_read(here('Daten', 'vg250_ebenen_0101'), layer = 'VG250_LAN', quiet = TRUE)

   # read SOEP data
   SOEP <- st_read(here('Daten', 'soep_v29'), quiet = TRUE)

   # read POI data
   POI <- st_read(here('Daten', 'POIs_Berlin.csv'), quiet = TRUE)


   # Step 2: Transform data
   # ~~~~~~~~~~~~~~~~~~~~~~

   # transform SOEP the data
   SOEP <- st_transform(SOEP, crs = common_crs)

   # transform POIs
   POI <- POI[, -4]

   POI <- st_as_sf(POI, coords = c("X", "Y"), crs = 4326)
   POI <- st_transform(POI, crs = common_crs)


   # Step 3: Subset the data
   # ~~~~~~~~~~~~~~~~~~~~~~~

   # Berlin only 
   BE <- States %>% 
     filter(GEN == 'Berlin')

   # interesting POIs only
   POI <- POI %>% 
     filter(Objekt %in% c("Brandenburger Tor", "Zitadelle Spandau", "Wannsee"))

   # SOEP survey year: 2011 only 
   SOEP <- SOEP %>% 
     filter(erhebj == survey_year)

   # SOEP housholds in Berlin only 
   SOEP <- SOEP[unlist(st_contains(BE, SOEP)), ]


   # Step 4: Plot the data
   # ~~~~~~~~~~~~~~~~~~~~~

   BE %>% 
     ggplot() + geom_sf() +
     geom_sf(data = SOEP) + 
     geom_sf(data = POI, col = 'red') +
     geom_sf(data = st_buffer(POI, dist = distance), 
             alpha = 0.2, fill = 'yellow')
             
.. figure:: png/example-1.png
   :align: center  
             

.. code:: r

   # Step 5: Get the desired information 
   # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   # get the household in relevant distance
   soep_in_5km <- st_is_within_distance(POI, SOEP, dist = distance)
   SOEP_5km <- SOEP[unlist(soep_in_5km), ]

   # get the distances
   dist_matrix <- as_tibble(st_distance(SOEP_5km, POI))
   names(dist_matrix) <- str_c('distance_to_', str_remove(POI$Objekt, ' '))

   # attach distances to data
   SOEP_5km <- bind_cols(SOEP_5km, dist_matrix)

   # find the nearest household
   nearest_hh <- st_nearest_feature(POI, SOEP_5km)

   SOEP_5km$nearest_to <- ''
   SOEP_5km$nearest_to[nearest_hh] <- POI$Objekt


   # Step 6: Check your data
   # ~~~~~~~~~~~~~~~~~~~~~~~

   BE %>% 
     ggplot() + geom_sf() +
     geom_sf(data = SOEP) + 
     geom_sf(data = POI, col = 'red') +
     geom_sf(data = st_buffer(POI, dist = distance), 
             alpha = 0.2, fill = 'yellow') + 
     geom_sf(data = SOEP_5km, col = 'blue') +
     geom_sf(data = SOEP_5km[SOEP_5km$nearest_to != '', ], col = 'green')


.. figure:: png/example-2.png
   :align: center  
             

.. code:: r

   # Step 7: Export your data
   # ~~~~~~~~~~~~~~~~~~~~~~~~
   path_export <- paste0('/home/', Sys.info()['user'], '/transfer/export/',  Sys.Date())

   if(!file.exists(path_export)){
     dir.create(path_export, recursive = TRUE)
   }

   st_write(SOEP_5km, file.path(path_export, 'Output.csv'), 
            append = FALSE, overwrite = TRUE)

::

   ## Deleting layer `Output' using driver `CSV'
   ## Writing layer `Output' to data source `/home/hwsteinhauer/transfer/export/2021-01-27/Output.csv' using driver `CSV'
   ## Updating existing layer Output
   ## Writing 356 features with 6 fields and geometry type Point.

.. code:: r

   README <- tibble(name = names(SOEP_5km)[-grep('geometry', names(SOEP_5km))],
                    description = c('erhebj', 
                                    'ID', 
                                    'distance (in meters) of household to Brandenburger Tor',
                                    'distance (in meters) of household to Wannsee',
                                    'distance (in meters) of household to Zitadelle Spandau',
                                    'household nearest to the POI'))

   write.csv(README, file.path(path_export, 'README.csv'), 
            row.names = FALSE)


Appendix
========


**Session info - Platform**


::

   ##  setting  value                       
   ##  version  R version 4.0.3 (2020-10-10)
   ##  os       Ubuntu 20.04.1 LTS          
   ##  system   x86_64, linux-gnu           
   ##  ui       X11                         
   ##  language (EN)                        
   ##  collate  de_DE.UTF-8                 
   ##  ctype    de_DE.UTF-8                 
   ##  tz       Europe/Berlin               
   ##  date     2021-01-27


**Session info - Packages**


::

   ##  package     * version date       lib source        
   ##  bookdown    * 0.21    2020-10-13 [1] CRAN (R 4.0.3)
   ##  dplyr       * 1.0.2   2020-08-18 [1] CRAN (R 4.0.2)
   ##  forcats     * 0.5.0   2020-03-01 [1] CRAN (R 4.0.0)
   ##  ggplot2     * 3.3.2   2020-06-19 [1] CRAN (R 4.0.1)
   ##  gridExtra   * 2.3     2017-09-09 [1] CRAN (R 4.0.0)
   ##  here        * 1.0.0   2020-11-15 [1] CRAN (R 4.0.3)
   ##  kableExtra  * 1.3.1   2020-10-22 [1] CRAN (R 4.0.3)
   ##  knitr       * 1.30    2020-09-22 [1] CRAN (R 4.0.2)
   ##  pacman      * 0.5.1   2019-03-11 [1] CRAN (R 4.0.2)
   ##  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
   ##  readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.3)
   ##  rgdal       * 1.5-19  2021-01-05 [1] CRAN (R 4.0.3)
   ##  sessioninfo * 1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
   ##  sf          * 0.9-7   2021-01-06 [1] CRAN (R 4.0.3)
   ##  sp          * 1.4-4   2020-10-07 [1] CRAN (R 4.0.3)
   ##  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
   ##  tibble      * 3.0.4   2020-10-12 [1] CRAN (R 4.0.3)
   ##  tidyr       * 1.1.2   2020-08-27 [1] CRAN (R 4.0.2)
   ##  tidyverse   * 1.3.0   2019-11-21 [1] CRAN (R 4.0.0)
   ## 
   ## [1] /home/hwsteinhauer/R/x86_64-pc-linux-gnu-library/4.0
   ## [2] /usr/local/lib/R/site-library
   ## [3] /usr/lib/R/site-library
   ## [4] /usr/lib/R/library

Section author: Hans Walter Steinhauer hwsteinhauer@diw.de

Last updated: 2021-02-22